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Abstract 

The EM algorithm is a widely used methodology for penalized like- 
lihood estimation. Provable monotonicity and convergence are the hall- 
marks of the EM algorithm and these properties are well established for 
smooth likelihood and smooth penalty functions. However, many relaxed 
versions of variable selection penalties are not smooth. The goal of this 
paper is to introduce a new class of Space Alternating Penalized Kull- 
back Proximal extensions of the EM algorithm for nonsmooth likelihood 
inference. We show that the cluster points of the new method are sta- 
tionary points even when on the boundary of the parameter set. Special 
attention has been paid to the construction of component-wise version 
of the method in order to ease the implementation for complicated mod- 
els. Illustration for the problems of model selection for finite mixtures of 
regression and to sparse image reconstruction is presented. 

Keywords: EM Algorithm Maximum Likelihood Estimation Sparsity 
Model Selection Space Alternating Algorithm Nonsmooth Penalty. 

1 Introduction 

The EM algorithm of Dempster Laird and Rudin (1977) is a widely applica- 
ble methodology for computing likelihood maximizers or at least stationary 
points. It has been extensively studied over the years and many useful general- 
izationshave been proposed including for instance the stochastic EM algorithm 
of Dclyon, Laviclle and Moulincs (1999); Kuhn and Lavielle (2004); the PX-EM 
accelerations of Liu, Rubin and Wu (1998); the MM generalization of Lange and 
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Hunter (2004) and the recent approach using extrapolation such as proposed in 
Varadhan and Roland (2007). 

In recent years, much attention has been given to the problem of variable 
selection for multiparameter estimation, for which the desiered solution is spase, 
i.e. many of the parameters are zero. Several approaches have been proposed 
for recovering sparse models. The main contributions in this direction are sparse 
Bayes learning (Tipping ()), LASSO-like penalties penalized least squares (Tib- 
shirani (1996)), ISLE (Friedman and Popescu (2003)), information theoretic 
based prior methods of Barron (1999), empirical Bayes (Johnstone and Silver- 
man ()) and "hidden variable" -type approach developped by Figueiredo and 
Nowak (2003). Among recent and exciting alternatives is the new Dantzig se- 
lector of Candes and Tao (2008). Of particular interest are penalization methods 
by miximizing the log-likelihood function with a penalty for non-sparsity. Most 
approaches use non-differentiable penalization. Sec for example the paper of 
Candes and Plan (2008) for a very elegant analysis of the Zi-type penalization in 
the context of linear variable selection. On the other hand, only a few attempts 
have been made to use this type of penalization for more complex models than 
the linear model; for some recent progress, see Koh, Kim, and Boyd (2007) for 
the case of logistic regression; and Khalili and Chen (2007) for mixture models. 
However, the use of non-differentiable penalties can be reasonnably expected to 
be extended to more complex nonlinear models, the mixture model being one 
of the most popular instance. 

The goal of the present paper is to propose new extensions of the EM al- 
gorithm that incorporate a non-differentiable penalty at each step. Following 
previous work of the first two authors, we develop a Kullback Proximal frame- 
work for understanding the EM-iterations and prove optimality for the cluster 
points of the methods using nonsmooth analysis tools. A key additional feature 
in our study is the consideration of Space Alternating extensions of EM and 
Kullback Proximal Point (KPP) methods. Such component-wise versions of 
EM-type algorithms can enjoy nice theoretical properties with respect to accel- 
eration of convergence speed (Fessler and Hero (1994)). The KPP method was 
applied to gaussian mixture models in Cclcux et al. (2001). The main result 
of our paper is that any cluster point of the Space Alternating KKP method 
satisfies a nonsmooth Karush-Kuhn- Tucker necessary optimality equation. 

The paper is organized as follows. In section 2 we present the penalized 
EM-type methods that we call Penalized Kullback Proximal Point methods. 
In Section 3, our main asymptotic results are presented. In Section 4, two 
examples are presented. The first is a space alternating implementation of the 
penalized EM algorithm for a problem of model selection in a finite mixture of 
linear regressions using the SCAD penalty introduced in Fan and Li (2001) and 
further studied in Khalili and Chen (2007). The second example is taken from 
Ting, Raich and Hero, (2007) in which the theoretical issue of convergence was 
not addressed. New asymptotic results follow from the theory developped in 
Section 3. 



2 



2 The EM algorithm and its Kullback proximal 
generalizations 

The problem of maximum likelihood (ML) estimation consists of finding a so- 
lution of the form 

9ml = argmax eee l y (9), (1) 

where y is an observed sample of a random variable Y defined on a sample space 
y and l v {9) is the log-likelihood function defined by 

l y (6) = log g(y; 6), (2) 

defined on the parameter space 9 C M. p , and g(y; 9) denotes the density of Y at 
y parametrized by the vector parameter 9. 

The standard EM approach to likelihood maximization consists of introduc- 
ing a complete data vector X with density /. Consider the conditional density 
function k(x\y; 9) of X given y 

k(x\y;6) = ip§-. (3) 

As is well known, the EM algorithm then consists of alternating between two 
steps. The first step, called the E(xpectation) step consists of computing the 
conditional expectation of the complete log-likelihood given Y. Notice that the 
conditional density k is parametrized by the current iterate of the unknown 
parameter values, denoted here by 9 for simplicity. Moreover, the expected 
complete log-likelihood is a function of the variable 9. Thus the second step, 
called the M(aximization) step consists of maximizing the obtained expected 
complete log-likelihood with respect to the variable parameter 9. The maximizer 
is then accepted as the new current iterate of the EM algorithm and the two 
steps are repeated in a recursive manner until convergence is achieved. 

Consider now the general problem of maximizing a concave function 
The original proximal point algorithm introduced by Martinet (1970) is an it- 
erative procedure which can be written 

9 k+1 = argmax 9e ^ - ^\\9 - 9 k f} . (4) 

The influence of the quadratic penalty \\\9 — 9 k \\ 2 is controled by using a se- 
quence of positive parameters {Pk}- Rockafellar (1976) showed that superlinear 
convergence of this method is obtained when the sequence {(3k} converges to- 
wards zero. A relationship between Proximal Point algorithms and the EM 
algorithm was discovered in Chretien and Hero (2000) (see also Chretien and 
Hero (2007) for details). We review the EM analyogy to PPK methods. Assume 
that the family of conditional densities {k(x\y; 9)}g & R P is regular in the sense of 
Ibragimov and Khasminskii (1981), in particular k(x\y; 9)fi(x) and k(x\y; 9)[i(x) 
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are mutually absolutely continuous for any 9 and 9 in MP. Then the Radon- 
Nikodym derivative fc[^'g] exists for all 9, 9 and we can define the following 
Kullback Leibler divergence: 

«M) = E[,„ g ig|w]. (5) 

Let us define Di as the domain of l y , Di g the domain of I y (-,9) and Dj the 
domain of I y (-,-)- Using the distance- like function I y , the Kullback Proximal 
Point algorithm is defined by 

9 k+1 = argmax, eD# {$(0) - (3 k I y (9, 9)} . (6) 

The following was proved in Chretien and Hero (2000). 

Proposition 2.1 [Chretien and Hero (2000) Proposition 1]. The EM algorithm 
is a special instance of the Kullback-proximal algorithm with (3 k = 1, for all 
k e N. 

2.1 The Space Alternating Penalized Kullback-Proximal 
method 

In what follows, and in anticipation of component- wise implementations of pe- 
nalized KKP, the parameter space is decomposed into subspaces r = 6 n S r , 
r = 1, . . . , R where Si, . . . ,<Sr are subspaces of Mf and W = ©^ =1 S r . 
Then, our Penalized Proximal Point Algorithm is defined as follows. 

Definition 2.1 Let tp: K p m 5i x • • ■ x Sr be a continuously differ entiable 
mapping. Let (/3fc)fceN be a sequence of positive real numbers and X be a positive 
real vector in M. R . Let p n be a possibly nonsmooth penalty function with bounded 
Clarke- subdifferential (see the Appendix for details) on compact sets. Then, the 
Space Alternating Penalized Kullback Proximal Algorithm is defined by 

R 

9 k+1 = argmax e£0fe _ 1(mod a)+1 nAnu / , M ) ~ S ^MMty) ~ PklyW, k ). 

r=l 

(7) 

The standard Kullback-Proximal Point algorithms as defined in Chretien and 
Hero (2007) is obtained as special case by selecting R = 1, 6i = 6, A = 0. 

In most cases, the mappings ip r will simply be the projection onto the sub- 
space 9 r , r = 1, . . . , R. 

2.2 Notations and assumptions 

The notation || • || will be used to denote the norm on any previously defined space 
without more precision. The space on which it is the norm should be obvious 
from the context. For any bivariate function <&, Vi<& will denote the gradient 
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with respect to the first variable. In the remainder of this paper we will make 
the following assumptions. For a locally Lipschitz function /, df{x) denotes the 
Clarke subdifferential of / at x; this notion is recalled in the Appendix. 

Assumptions 1 (i) l y is differentiable on Di and l y {&) — 53r=i •VPn(V'r(0)) 
converges to — oo whenever \\6\\ tends to +00. 

(ii) the projection of Dj onto the first coordinate is a subset of Di. 

(iii) (/3fc)fcgN is a convergent nonnegative sequence of real numbers whose limit 
is denoted by 0* . 

(iv) the mappings if) r are such that 



for all d G S^r and e > sufficiently small so that 9 r + ed € 8, r — 1, . . . , R. 

We will also impose one of the two following sets of assumptions on the distance- 
like function I y . 

Assumptions 2 (i) There exists a finite dimensional euclidean space S, a dif- 
ferentiable mapping t : Di t—> S and a functional 'J : cSxShI such 
that KL divergence ((5]) satisfies 



where denotes the domain of 

(ii) For any {t k , t)k£^} C there exists p t > such that lim|| t fc_ t ||_ >00 I y (t k , t) > 
p t . Moreover, we assume that inf tS A/ p t > for any bounded set M C S. 

For all {t! , t) in , we will also require that 

(iii) (Positivity) > 0, 

(iv) (Identifiability) *(V, t) = ^ t = f , 

(v) (Continuity) ^ is continuous at (t',t) 

and for all t belonging to the projection of onto its second coordinate, 

(vi) (Differentiability) the function is differentiable at t. 

Assumptions 3 (i) There exists a differentiable mapping t : D\ 1— * M nxm such 
that the Kullback distance-like function I y is of the form 



where for all i and j, Uj is continuously differentiable on its domain of definition, 
otij is a function from y to R + , the set of positive real numbers, 

(ii) The function <fi is a non negative differentiable convex function defined for 
positive real numbers only and such that 0(r) = if and only if r = 1. 

(iii) There exists p > such that 



ip r (9 r + ed) = ip r (9 r ) 



(8) 



I v (0,9) = *(t(9)A8)), 




lim 4>{t) > p 



(iv) The mapping t is injective on each Q r . 
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In the context of Assumptions [3J Dj is simply the set 

Bi = {9 G K p | tij(6) > Vt G {l,...,n} and j G {1,.. .,to}} 2 . 

Notice that if Uj (9) — 9i and ot^ — 1 for all i and all j, the functions ij, turn out 
to reduce to the well known (f> divergence defined in Csiszar (1967). Assumptions 
[3] are satisfied by most standard examples (for instance Gaussian mixtures and 
Poisson inverse problems) with the choice <j){r) = rlog(r) — 1. 

Assumptions QJi) and (ii) on l y are standard and are easily checked in prac- 
tical examples, e.g. they are satisfied for the Poisson and additive mixture 
models. 

Finally we make the following general assumption. 

Assumptions 4 The Kullback ■proximal iteration ^ is well defined, i.e. there 
exists at least one maximizer of at each iteration k. 

In the EM case, i.e. (3 = 1, this last assumption is equivalent to the com- 
putability of M-steps. In practice it suffices to solve the inclusion G Vl y (0) — 
Xdp n (ip(9)) — (3k\7I y (9,9 k ) in order to prove in practice that the solution is 
unique. Then assumption [lji) is sufficient to conclude that we actually have a 
maximizer. 

3 Asymptotic properties of the Kullback-Proximal 
iterations 

3.1 Basic properties of the penalized Kullback proximal 
algorithm 

Under Assumptions [1] we state some basic properties of the penalized Kullback 
Proximal Point Algorithm. The most basic is the monotonicity of the penalized 
likelihood values taken by successive iterates and the boundedness of the penal- 
ized proximal sequence (9 k )k£fi- The proofs of the following lemmas are given, 
for instance, in Chretien and Hero (2000) for the case where A = and their 
generalizations to the present context is straightforward. 
We start with the following monotonicity result. 

Lemma 3.1 For any iteration k G N, the sequence (9 k )keN satisfies 

R R 

l y (9 k+1 )-Y, X rPn(M0 k+1 ))-(^ k )-Y, X ^M0 k ))) > W 
r—1 r—1 

(9) 

Lemma 3.2 The sequence (9 k )kem is bounded. 

The next lemma will also be useful and its proof in the case where A — is 
given in Chretien and Hero (2007) Lemma 2.4.3. The generalization to A > is 
also straightforward. 
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Lemma 3.3 Assume that in the Space Alternating KPP sequence (8 k )k^n, there 
exists a subsequence (9 a ^)keN belonging to a compact set C included in Di. 
Then, 

km p k I y (9 k+1 ,9 k ) = 0. 

k — >oo 

One important property which is observed in pratice and onto which stopping 
criteria often rely is that the distance between two successive iterates decreases 
to zero. This fact was established in Chretien and Hero (2007) and does not 
depend on the fact that A = 0. 

Proposition 3.1 [Chretien and Hero (2007) Proposition 4.1.2] The following 
statements hold. 

(i) For any sequence (0 k )keti in and any bounded sequence (r) k )ken in 
M+, the fact that limfc_> +00 I y (ri k ,9 k ) = implies limfc^ +00 \tij(r) k )—tij(9)\ = 
for all i,j such that otij =/= 0. 

(ii) If one coordinate of one of the two sequences (0 )feeN and (?7 tends 
to infinity, so does the other's same coordinate. 

3.2 Properties of cluster points 

The results of this subsection state that any cluster point 8* such that (9*, 8*) 
lies on the closure of Di satisfies some modified Karush-Kuhn- Tucker type con- 
ditions on the domain of the log-likelihood function. For notational convenience, 
we define 

R 

F p {9, 9) = l v {8) - KPniM®)) - (3I y (9,e). (10) 

We first establish this result in the case where Assumptions [2] hold in addition 
to Assumptions [1] and [3 for the Kullback distance- like function I y . 

Theorem 3.1 Assume that Assumptions]]^ [H and[^] hold and if R > 1, then 
for each r = 1, . . . , R t is infective on r . Let 8* be a cluster point of the Space 
Alternating Penalized Kullback-proximal sequence of Definition \2.1\ Assume 
that all the functions are differentiable at 8* . If 8* lies in the interior of D\ , 
then 9* is a stationary point of the log-likelihod function l y (9), i.e. 

R 
r=l 

Proof. We consider two cases, namely the case where R = 1 and the case where 
R> 1. 

A. If R = 1 the proof is exactly analog to the proof of Theorem 3.2.1 in 
Chretien and Hero (2007). In particular, we have 

F p ,{8*,8*) > F(8,8*) 

for all 8 such that (8,8*) e D/. Since I y (9,9*) is differentiable at 9* , the result 
follows by writing the first order optimality condition at 8* in (|11|) . 
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B. Assume that R > 1 and let (x"^ k ')kem be a subsequence of iterates of (|7) 
converging to 9* . Moreover let r = 1, . . . ,R and 9 € 6 r nD|. For each fc, let 
a r (k) the next index greater than a(k) such that (u(k) — 1) (mod i?) + 1 = r. 
Using the fact that t is injective on every 8 r , r — 1, . . . , R, Lemma |3~31 and the 
fact that ((3k)keN converges to (3* > 0, we easily conclude that (9 ar ^)keN and 
(0ovO)+i-) fegN a j gQ converge to 9*. 

For k sufficiently large, we may assume that the terms (Q^rik+i) ^ g<r r (k)^ anc j 
(9, 0°>W) belong to a compact neighborhood C* of (9*, 9*) included in L>/. By 
Definition 1 2. II of the Space Alternating Penalized Kullback Proximal iterations, 

F^ r(k) (e^)+\9^) > F PrrW (9,0 Mh) ). 

Therefore, 

F^(9°^+\9°^) ~((3 aAk) ~(3*)Iy(0 aAk)+ \0^)> 

Continuity of -Fjg follows directly from the proof of Theorem 3.2.1 in Chretien 
and Hero (2007) where in this proof a(k) has to be replaced by ay(fc)). This 
implies that 

Ff3*(9*,9*)> F(9,9*) 

for all 9 £ r such that (9,9*) e C* C\ Dj. Finally, recall that no assumption 
was made on 9, and that C* is any compact neighborhood of 9*. Thus, using 
the assumption QJi), which asserts that l y {9) tends to — oo as ||0|| tends to +oo, 
we may deduce that (fT2]) holds for any e 9 r such that (9,9*) € £>/ and, 
letting e tend to zero, we see that 9* maximizes Fp*(9,9*) for all 9 £ Q r such 
that (9,9*) belongs to Dj as claimed. 

To conclude the proof of Theorem 13.11 take d in MP and decompose d as 
d = d\ + • • • + df? with d r G <S r . Then, equation (|12p implies that the directional 
derivatives satisfy 

F' /3 ,(9*,9*;d r ) <0 (12) 

for all r = 1, . . . , R. Due to Assumption [1] (iv), the directional derivative of 
Sr=i ^rPnipr(')) m the direction d is equal to the sum of the partial derivatives 
in the directions d\ , . . . , da and since all other terms in the definition of Fp are 
diffcrentiable, we obtain using (fT2"| . that 

R 

F'p, (9* ,9*;d)=J2 F P* ( e * > d *-> d r)<0- (13) 

r=l 

Therefore, using characterization (prS)) in the Appendix of the subdifferential, 
the desired result follows. □ 
We now, consider the case where Assumptions [3] hold. 

Theorem 3.2 Assume that in addition to Assumptions^ and^ Assumptions 
[3] hold. Let 9* be a cluster point of the Space Alternating Penalized Kullback 



8 



Proximal sequence. Assume that all the functions tij are continuously differ- 
entiable at 9*. Let I* denote the index of the active constraints at 9* , i.e. 
I* = s.t. U : j(6*) = 0. If 9* lies in the interior of Di, then 9* satisfies 

the following property: there exists a family of subsets I* C I* and a set of real 
numbers A^ , (i, j) G I* , r = 1, . . . , R such that 

R R. 

OeVZ y (r)-^A^„(Vv(0*)) + ^ £ A*-P Sr (V%(0*)). (14) 

r=l r=l (ij)eZ;* 

Remark 3.1 The condition \1J$ ressembles the traditional Karush-Kuhn- Tucker 
conditions of optimality but are in fact weaker since the vector 

R 

E E ^p&(vt«(**)) 

r=i (i,i)ex;« 

may not belong to the normal cone at 9* to the set {6 \ tij > 0, i = 1, . . . , n, j = 
1, . . . ,m}. 

Proof of Theorem 13.21 Let &ij(9, 9) denote the bivariate function defined 

As in the proof of Theorem l3.11 let (x a ^)keN be a subsequence of iterates of 
converging to 9* . Moreover let r = 1, . . . , i? and € O r flD;. For each fc, let 
oy(fc) be the next index greater than a(k) such that (a r (k) — 1) (modi?) + 1 = r. 
Using the fact that t is injective on every 8 r , r = 1, . . . , R, Lemma 13731 and the 
fact that (/3k)keN converges to (3* > 0, we easily conclude that (9 ar ^)keN and 
(0Mfe)+l) feeN a i so converge to 9*. 

Due to Assumption [3] (iv), the first order optimality condition at iteration 
oy(fc) can be written 

= P Sr (Vly(9^ +1 )) - \ r g° Ak)+1 + A, r(fe) (E« o«(l/i)P5 t (Vt, i (^-W+ 1 )) 

(15) 

with 5 ^ (fc)+1 g a Pn Vr(^" (fc)+1 )). 

Moreover, Claim A in the proof of Theorem 4.2.1 in Chretien and Hero 
(2007), gives that for all (i,j) such that ctij(yj) ^ 

lim t lj {9 a ^ +1 )\/ 1 ^ l3 {9' I ^ +1 1 9' lAk ^) =0. (16) 

k — >+oo 

Let I* be a subset of indices such that the family {Ps T C^Uj(9*))}(i,j)ex* 1S 
linearly independent and spans the linear space generated by the family of all 
projected gradient {Ps T (^tij(6*))}i=i,...,nj=i,...,m- Since this linear indepen- 
dence and generating properties are preserved under small perturbations using 
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continuity of the gradients, we may assume without loss of generality that the 
family 

{p 5r (vM^ (fc)+1 ))V , 

is linearly independent for k sufficiently large. For such fc, we may thus rewrite 
equation (TK))) as 

= P 5r (VZ y (^ W+1 )) - K9r Ak)+l + (3 aAk) (E W)6 i; < r(fe)+1 fe) 

P5 r (Vt y (^« +1 )) + E« a y -(%)t y -(^" (fe)+1 )P5 I (V 1 <i>(r-W+ 1 ,r-( fe )))). 

(17) 

Claim. The sequence {tt^^^ 1 (yj)}keN has a convergent subsequence for 
all (j, j) in I* . 

Proof of the claim. Since the sequence (# fc )fcgN is bounded, ip is continu- 
ously differentiable and the penalty p n has bounded subdifferential on compact 
sets, there exists a convergent subsequence (gr r ^ J ^ k ^ +1 )keti with limit g*. Now, 
using Equation (|16p . this last equation implies that the {^"u +1 {Vj)}(i ,i)£X* 
converges to the coordinates of a vector in the linearly independent family 
{Ps r C^tij(^*))}(i,j)i£i* which concludes the proof. □ 

This claim allows us to finish the proof of Theorem l3.2l Since a subsequence 
(Uj))(i,j)ei* i s convergent, we may consider its limit 
Passing to the limit, we obtain from equation (| 1 5[) that 

O = P Sr (VZ !/ (0*))-A r <7* + /3*( Y, <i p s,(Vty(0*))). (18) 

Using the outer semi-continuity property of the subdifferential of locally Lip- 
schitz functions (see Appendix) we thus obtain that g* £ dp n ip r (0*)). Now, 
summing over r in (|18p . we obtain 

R R R. 

= E p ^( v ^( r ))-E A ^+/3*E( E 4- p *(v*ii(fl*)))- 

r=l r=l r=l 

Moreover, since ^. lJ (9 tTr ^ +1 1 9 ar( - k ^) tends to zeros if $ I*, i.e. if is 
not active, passing to the limit in equation equation (|15p implies that 

o=EPs r (vun)-EA r ^+/?*E( S 4 p ft(v*«(^))) 

r=l r=l r=l 

for /** being the subset of active indices of I*, i.e. I** = I* 1*. Since 
Ef=i ^ffr e Er=i ^rdp n (ipr{8*)), this implies that 

oe v^(r)-^A r 3 Pn (Vv(0*)) + /3*E ^ 4- p 5r(v*ii(»*)). (19) 

r=l r=l (ij')el;* 
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which establishes Theorem 13.21 once we take X*j — A*^-. □ 
The result (fl9|) can be refined to the classical Karush-Kuhn- Tucker type 
condition under additional conditions such as stated in the next corrolary. 

Corollary 3.1 If in addition to the assumptions of Theorem \ 3.°A we assume 
that either Ps r (V%(0*)) = Vtij(0*) or Ps r (Vtij(0*)) = for all € I*, 

i.e. such that Uj(6*) = 0, then there exists a set of subsets I* C I* and a 
family of real numbers \j , € X* , r = 1, . . . , R such that the following 

Karush-Kuhn- Tucker condition for optimality holds at cluster point 9* : 

R R 

o g vi y (9*) - Y, Kd Pn (Ms*)) + E E A *i v %(0*)- 
4 Examples 

In this section, we show two applications of the penalized KKP algorithm ([7]) 
to enforce sparsity in a multiple parameter estimator. We first consider the 
finite mixtures of linear regression models using the SCAD penalty for variable 
selection as studied in Khalili and Chen (2007). We then address a problem in 
sparse image reconstruction studied in Ting, Raich and Hero (2007). 

4.1 Variable selection in finite mixtures of regression mod- 
els 

Until quite recently, variable selection in regression models was performed 
using penalization strategies in the maximum likelihood framework, e.g. using 
AIC, Akaike (1973) and BIC, Schwarz (1978) for instance. The main drawback 
of these approaches is the combinatorial explosion of the set of possible models 
in the case where the number of variables is large. Recently, new approaches 
have been proposed that select the subsets of variables without enumeration of 
all subsets of a given size. Most such methods use Zi-type penalties of likelihood 
functions as in the LASSO, Tibshirani (1996). The recent Dantzig selector of 
Candes and Tao (2007) also uses the l\ penalty. 

Computation of the maximizers of the penalized likelihood can be performed 
using standard algorithms for nondifferentiable optimization such as bundle 
methods as introduced in Hiriart-Urruty and Lemarechal (1993). However gen- 
eral purpose optimization methods might be difficult to implement in the situ- 
ation where, for instance, log functions induce line-search problems. In certain 
cases, the EM algorithm or its generalizations, or a combination of EM type 
methods with general purpose optimization routines might be simpler to imple- 
ment. Variable selection in finite mixture models, as described in Khalili and 
Chen (2007), is such a case due to the presence of very natural hidden variables. 

In the finite mixture estimation problem considered here, y\ , . . . , y n are real- 
izations of the response variable Y and the associated realizations 
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of the P-dimensional vector of covariates X. We focus on the case of a mixture 
of linear regression models sharing the same variance as in the baseball data 
example of section 7.2 in Khalili and Chen (2007), i.e. 



K 



Y ~^ 7rfe ^(X i /3 fc)( r 2 ), (20) 



fc=i 



with 7Ti, . . . , 7Tfc > and 7Tj. = 1. The main problem discussed in Khalili 

and Chen (2007) is model selection for which a generalization of the smoothly 
clipped absolute deviation (SCAD) method of Fan and Li (2001,2002) is pro- 
posed using an MM-EM algorithm in the spirit of Hunter and Lange (2004). 
No convergence property of the MM algorithm is established. The purpose of 
this section is to show that the Space Alternating KPP EM genaralization is 
easily implemented and that stationarity of the cluster points is garanteed by 
the theoretical analysis of Section [3] 

The SCAD penalty, studied in Khalili and Chen (2007) is a modification of 
the l\ penalty which is given by 

k p 

Pn(Pl, ■ ■ ■ ,Pk) =^2 7r k^2p lnk (Pk,j) (21) 

fc=l J=l 

where p n u is specified by 

P^M = TnfcVnlvBl/9|<7ni, + —y l -MP\>lnu ( 22 ) 

for (3 in R. 

Define the complete data as the class indices z\ , . . . , z n of the mixture com- 
ponent from which the observed data point y n was drawn. The complete log- 
likclihood is then 

h(Pi, ...,0k, v 2 ) = E log(^) - \ log(2 7 ra 2 ) - ^~ jf*^ , (23) 

i— 1 ~ 

Setting 6 — (wi, . . . , ttk, (3i, ■ ■ ■ , (3k, & 2 ), the penalized Q-function is given by 



n K r -, / Ja \2l 



,0) = 



i=l k=l 

where 



l0g(7Tfc) - -l0g(27TCT 



1 ,__,o__a, (Vi - x lPkY 



2 bv ; 2a 2 



-Pn(Pl, ■ ■ -,Pk) 

(24) 



t ik (0) = ^ ^ — '-^ . (25) 

2^=1 ^ ex p ^ 2^ ; 

The computation of this Q-function corresponds to the E-step. Due to the 
fact that the penalty p n is a function of the mixture probabilities tt^ , the M-step 
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estimate of the tt vector is not given by the usual formula 



1 - 

= ~X>fc(*) k=l,...,K, (26) 



although this is the choice made in Khalili and Chen (2007) in their imple- 
mentation. Moreover, optimizing jointly in the variables fa and tt^ is clearly 
a more complicated task than independently optimizing with respect to each 
variable. We implement a componentwise approach consisting of successively 
optimizing with respect to the tt^ 's and alternatively with respect to each vector 
fa. Optimization with respect to the tt^'s can be easily performed using any 
standard optimization routine and optimization with respect to the fa's requires 
a specific algorithm for optimization of non-differentiable functions as provided 
by the function optim of Scilab using the 'nd' (standing for 'non-differentiable') 
option. 

We now turn to the description of the Kullback proximal penalty I y defined 
by ([5]). The conditional density function k{y\, . . . , y n , Z\, . . . , z n | y\, . . . ,y n ;0) 
is 

n 

k(yi,...,y n ,zi,...,Zn | yi, . . . ,y n ;9) = Y[t iZi (0). 

i=i 

and therefore, the Kullback distance- like function I y (9,6) is 



n K 



i=l k=l K ' 

We have R = K + 1 subsets of variables with respect to which optimization 
must be performed successively. All components of assumptions [1] and [3] are 
trivially satisfied for this model except for Assumption [3] (iv) . However As- 
sumption 0] (iv) is proved in Lemma 1 of Celeux et al. (2001). On the other 
hand, since Uk(9) = implies that tt^ = and tt^ = implies 

|!r(*) = ° (28) 

for all j = 1 , . . . , p and I = 1 , . . . , K and 

=0, (29) 

it follows that Ps r (VUk(0*)) = V^fc(0*) if S r is the vector space generated by 
the probability vectors tt and Ps r (ytik{9*)) = otherwise. Therefore, Corollary 
13. II applies. 

We now turn to some experiments on the real data set (available at 
http://www. amstat. org /publications /jse/v6n2/ datasets.watnik. html ) . 
Khalili and Chen (2007) report that a model with only two components was 
selected by the BIC criterion in comparision to the three components model. 
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5 10 15 20 25 30 35 





Figure 1: Baseball data of Khalili and Chen (2007). This experiment is per- 
formed with the plain EM. The parameters are j n k — .1 and a — 10. The first 
plot is the vector (3 obtained for the single component model. The second (resp. 
third) plot is the vector of the optimal j3\ (resp. fo)- The fourth plot is the 
euclidean distance to the optimal 8* versus iteration index. The starting value 
of 7Ti was .3 



14 



10 20 30 40 50 60 70 80 



Figure 2: Baseball data of Khalili and Chen (2007). This experiment is per- 
formed with the plain EM. The parameters are 7„fc = 5 and a = 10. The plot 
shows the probability ttx of the first component versus iteration index. The 
starting value of w± was .3 

Here, two algorithms are compared: the approximate EM using (|26p and the 
plain EM using the optim subroutines. The results for 7„fc = 1 and a = 10 are 
given in Figures [T] 

The experiments shown in Figure [T] that the approximate EM algorithm has 
similar properties to the plain EM algorithm for small values of the threshold 
parameters "f n k- Moreover, the larger the values of 7„/ c , the closer the probability 
of the first component is to 1. One important fact to notice is that with the 
plain EM algorithm, the optimal probability vector becomes singular, in the 
sense that the second component has zero probability as shown in Figure [2] 
(we fixed a maximum upper bound equal to .99 in order to avoid numerical 
problems). Figure [3] demonstrates that this behavior is not reproduced by the 
approximate EM algorithm chosen by Khalili and Chen (2007). 

4.2 h/h penalized EM for sparse image reconstruction 

In this section, we justify the convergence of the ^-penalized EM algorithm 
of Ting, Raich and Hero (2007). 

The image will be denoted by 9 € MP and the main problem is to reconstruct 
this image from a set of noisy measurements y G M. N , e.g. a set of noisy projec- 
tions of the image. We assume that the image 9 is sparse, i.e. the number of 
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Figure 3: Baseball data of Khalili and Chen (2007). This experiment is per- 
formed with the approximate EM. The parameters are 7„fc = 5 and a = 10. 
The plot shows the probability tti of the first component versus iteration index. 
The starting value of iri was .3 

nonzero pixels is small compared to the size of 9. The projection y is obtained 
from 9 via a linear transformation with additive gaussian white noise 



where w ~ 7V(0, a 2 1) and where H e~R Nxd . 

One very successfull method for reconstruction of sparse signals is the LASSO. 
This method was first proposed in Alliney and Ruzinsky (1994) and then further 
developped in Tibshirani (1996). The LASSO estimator is given by 



where (3 is a regularization parameter that can be tuned manually. We will 

denote by p(y \ 9) the density of Y whose log-likelihood is — g^a^" ■ 

In Ting, Raich and Hero (2007), the authors propose a more general frame- 
work allowing for more general possible penalties and in particular more tractable 
forms of the LAZE prior of Johnstone and Silverman (2004). The prior incor- 
porated in the Ting Raich and Hero model is as follows. For each i — 1, . . . , d 
define the random variables 9t and /; such that 9i = 9J-i with the following 



y = H9 + w 



(30) 



§(y,(3) = a,rgmin e \\H6-y\\ 2 2 +0\\6\\ 1 



(31) 
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density 



JO with probability (1 — w) 
I 1 with probability w 

^lyjf 1 :^ (33) 

where g(-) is some p.d.f. that will be specified later on and where it is assumed 
that the sequence {(9i,Ii)} is i.i.d. The variables {h} play the role of the 
delta function in the standard LAZE prior. The Maximum A Posteriori (MAP) 
reconstruction problem is then given by 

(§, I, w, a) = argmax£ |J|lU|a logp(0, 1 \ Y, w, a). (34) 

Let T\ = {i | Ii = 1} and Iq = {i \ h — 0}. The MAP problem is equivalent to 

max- 11 ^ 11 ' +(M-Card(X 1 ))log(l-w)+Card(Xi)logw , . 

Maximization is performed in a block coordinate-wise fashion, handling max- 
imization over (w,a) and (9,1) by alternating between them as described in 
Ting, Raich and Hero (2007) Section IV, Algorithm 1. 

Two options are considered MAPI and MAP2; we refer to Ting, Raich and 
Hero (2007) for more details. We only present MAPI since our results readily 
apply to MAP2 once case MAPI has been justified. Set g(x) = j(x, a) where 
|ae~°l x l is the Laplacian p.d.f. Maximization over (w, a) is easily obtained by 

a=^-and^=^#l. (36) 
11% 

The M step with respect to (9, 1) is obtained by applying the EM strategy. For 
this purpose, introduce the complete data model 

Z = 9 + wi (37) 
Y = HZ + w 2 (38) 

where w\ and w 2 are gaussian white noises with w\ ~ Af(0, a 2 1) and w = 
Hwi + W2 which implies that w 2 ~ N(Q,cr 2 I — oPHH 1 ). This representation 
is very interesting since it allows to decompose the problem in two tasks, the 
first being the one of deconvolving, the second of denoising. In this model, the 
hidden data is z = 9 + aw. Assuming (w, a) fixed, the E-step of the algorithm 
is obtained as follows from Figueiredo and Nowak (2003), Section V.B. We 
can write the complete likelihood fy,z\e = fy\z,efz\e- First, given Z, Y is 
independent of 9. Next, we have 

at n ynt y 

\og f zle (Z)= ^ +C (39) 
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where C is a constant dependent on 9. Hence, finding the function Q(9,0), is 
equivalent to replacing Z by its conditional expectation given Y and 9. This 
conditional expectation is a deconvolution step and is given by 



E 



Z I Y, 



a 



H\Y - HO). 



therefore, the E-step corresponds to the computation of 



E-step ) 



2a 2 



10-E 



Z I Y, 



9*9 - 20*E 



Z I Y, 



2a 2 



(40) 



-a (4i) 



Now, recalling that 9i = 9iU : the M step is 



( M-step ) 9i 



T hy (E[Z\Y,9};< 



2a 2 log (±^,aa 2 ) if < w < \ 



T S (E[Z | Y,9]-aa 2 ) 
where Th v denotes the hybrid soft thresholding function 

T hy (x,t 1 ,t 2 ) = (x - sign(x)i2l| x |>ti) 
and T s denotes the usual soft thresholding function 

T s (x,t) = T hy (x,t,t). 



if \ < w < 1 



(42) 
(43) 

(44) 



Finally, we obtain an EM algorithm which satisfies the assumptions of The- 
orem GQ] above and the asymptotic properties asserted by Theorem 13 . 1 1 hold . 



5 Appendix: The Clarke sub differential of a lo- 
cally Lipschitz function 

Since we are dealing with non differentiable functions, the notion of generalized 
differentiability is required. The main references for this appendix are Clarke 
(1990) and Rockafellar and Wets (2004). A locally Lipschitz function /: W i-> M 
always has a generalized directional derivative f°{9,u>): W x W i— ► K in the 
sense given by Clarke, i.e. 

f (9,uj) = hm sup^ eRp ^ e no . (45) 

The Clarke subdifferential of / at 9 is the convex set defined by 

df{6)={r 1 \f {6,u>)>r l t u, V^}. (46) 



Proposition 5.1 The function f is differentiable if and only if df(9) is a sin- 
gleton. 
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We now introduce another very important property of the Clarke subdifferential 
related to generalization of scmicontinuity for set-valued maps. 

Definition 5.1 A set-valued map <E> is said to be outer- semicontinuous if its 
graph 

graph * = {(0, fl ) \g e $(0)} (47) 

is closed, i.e. if for any sequence graph<f> D {0 n ,g n ) — > (#*,<7*) as n — > +oo, 
t/ien (0*,g*) e graph$. 

One crucial property of the Clarke subdifferential is that it is outer-semicontinuous. 
A point is said to be a stationary point of / if 

6 8/(9). (48) 

Consider now the problem 

sup (49) 

subject to 

«7(0) = [ffi(0),...,0 m (0)]*>O (50) 

where all the functions are locally Lipschitz from R p to R. Then, a necessary 
condition for optimality of 9 is the Karush-Kuhn- Tucker condition, i.e. there 
exists a vector u e R+ such that 

m 

oed/(0) + 5>^ 5j (0). (51) 

Convex functions are in particular locally Lipschitz and the notions of subdiffer- 
ential can the Clarke subdifferential is well defined. The main references on their 
particular properties are Rockafcllar (1970) and Hiriart-Urruty and Lemarechal 
(1993). 
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